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Abstract 

We present a framework for a large-scale distributed eScience Artificial Intel- 
ligence search. Our approach is generic and can be used for many different 
problems. Unlike many other approaches, we do not require dedicated ma- 
chines, homogeneous infrastructure or the ability to communicate between 
nodes. We give special consideration to the robustness of the framework, 
minimising the loss of effort even after total loss of infrastructure, and allow- 
ing easy verification of every step of the distribution process. In contrast to 
most eScience applications, the input data and specification of the problem is 
very small, being easily given in a paragraph of text. The unique challenges 
our framework tackles are related to the combinatorial explosion of the space 
that contains the possible solutions and the robustness of long-running com- 
putations. Not only is the time required to finish the computations unknown, 
but also the resource requirements may change during the course of the com- 
putation. We demonstrate the applicability of our framework by using it to 
solve a challenging and hitherto open problem in computational mathemat- 
ics. The results demonstrate that our approach easily scales to computations 
of a size that would have been impossible to tackle in practice just a decade 
ago. 
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1. Introduction 

The last decade has seen an unprecedented rise in the computing power 
that institutions and even individuals have access to. This is not only true 
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for individual processors, but also the number of processors and machines. 
During the last few years, a dramatic paradigm shift from ever faster proces- 
sors to an ever increasing number of processors and processing elements has 
occurred. Even basic contemporary machines have several generic processing 
elements and specialised chips for e.g. graphics processing. 

The size of problems people are interested in solving and the amount of 
data that needs to be processed in order to do that has grown dramatically 
as well. Today, amounts of data are routinely processed that could not even 
have been stored a decade ago. All this presents computer science with new 
and challenging research directions. 

The processing of so-called "big data" is one of the directions where a lot 
of research has been done and a lot of tools have been developed. Applica- 
tions can be scaled across hundreds of machines relatively easily. The situa- 
tion in many areas of Artificial Intelligence is completely different however. 
Distributing problems across several machines has been a research endeavour 
long before the advent of easily accessible computational resources and big 
data. The problems AI aims to solve have always required a large amount of 
computational resources to solve problems of practical relevance. 

Considering the keen interest of AI researchers in parallelisation, it is 
somewhat paradoxical that frameworks to distribute AI techniques are still 
in their infancy when it comes to practical applications. One such example 
is Apache Mahout [T], which leverages the generic Hadoop framework to 
distribute Machine Learning algorithms. For AI search on the other hand, 
there are, to the best of our knowledge, no similar frameworks. 

Artificial Intelligence search has close links with eScience research, being 
used to plan workflows [2], identify optimal protein and DNA structures [3111], 
and obtain qualitative models of dynamics systems arising in a wide range 
of scientific areas [HI El [3, EJ [9] . 

AI search involves the efficient creation, exploration and pruning of very 
large search trees (for the game of chess, the tree has an estimated 10 47 
nodes). In many cases it is acceptable to find the first solution from many 
candidates, or accept sub-optimal solutions with respect to a cost function 
to limit the amount of search performed. However, we often require either all 
solutions to a given problem, or a solution that has a guarantee of optimality. 

Even when only the first solution is required, the time to find it can 
quickly grow to days, months or even years on a single computer. In most 
cases, this is unacceptable - we must be able to find a solution in less time. 
There are two strategies for achieving this. The AI search techniques can be 
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improved to be more efficient for the problem or the search can be distributed 
across several machines such that the time to find a solution decreases with- 
out actually decreasing the total effort. The framework presented in this 
paper pursues the latter strategy. 

Our requirements for such a framework can be summarised as follows. 

• Scalability. 

We want to be able to use as many resources as possible at the same 
time, regardless of type and location and with minimal connectivity 
requirements. 

• Robustness. 

The framework must be able to cope with hardware and similar failures. 
In particular, the amount of computational effort lost because of such 
an event should be small. 

• Verifiability. 

In order to be useful for solving open problems, we must be able to 
follow each step in the distribution process to verify that AI search 
proceeded correctly and no solutions were lost. 

In this paper we describe a framework that fulfils these requirements. The 
design and implementation is motivated by the Recovery Oriented Comput- 
ing [TOl [11] aspects of the much wider research into Ultralarge systems [T2] . 
The AI search undertaken is Constraint Programming, described in Sec- 
tion |1.1[ This is not a restriction, as most AI search problems can be ex- 
pressed as Constraint Programming problems. The application area that we 
use to evaluate the implementation of the framework is described in Sec- 
tion o 

1.1. Constraint Programming 

Constraints are a natural and compact way of representing problems that 
are ubiquitous in everyday life. Constraint Programming investigates tech- 
niques for solving problems that involve constraints. Common application 
domains include other areas of Artificial Intelligence such as planning, but 
also real world and industrial applications such as scheduling, design and 
configuration or diagnosis and testing. Wallace [13] gives an early overview 
of application areas. 
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Formally, a constraint problem is a triple (X,T>,C), where X is a finite 
indexed set of variables X\, x%, ■ ■ ■ , x n . Each variable X{ has a finite domain of 
possible values Di G P. The set C is a finite set of constraints on the variables 
in X . A constraint is a relation that restricts the values of the variables in 
its scope. A solution to a constraint problem is a complete assignment of 
values from the respective domains to all variables Xj G X such that none of 
the constraints Cj G C is violated. 

In constraint programming, a distinction is usually made between con- 
straint satisfaction problems (CSPs) and constrained optimisation problems 
(COPs). A solution to the former only has to satisfy all the constraints, 
whereas a solution to the latter is also given a score by a cost function that 
needs to be optimised. As such, it is usually not sufficient to find only the 
first solution of a COP even if only one solution is required unless this first 
solution can be shown to be optimal. In the remainder of this paper we 
consider, without loss of generality, CSPs. 

Constraint problems are typically solved by building a search tree in which 
the nodes are assignments of values to variables and the edges lead to assign- 
ment choices for the next variable. If at any node a constraint is violated, 
search backtracks by returning to a previous state. If a leaf is reached and 
no constraints are violated, all variables have been assigned values and this 
set of assignments denotes a solution to the CSP. 

Clearly the search trees are exponential in the number of variables. Ex- 
ploring all of them is infeasible in many cases and inference is used at each 
node of the search tree to prune values from the domains of unassigned vari- 
ables that cannot be part of a solution based on the assignments made so 
far. Inference also allows to backtrack before a constraint is violated - if the 
domain of a particular variable becomes empty, the set of assignments made 
so far cannot be part of a solution. 

The inference checks have a computational cost and the trade-off is be- 
tween the effort of making checks - hopefully resulting in a reduction of the 
search space - and the effort of searching a presumably larger tree but at a 
cheaper cost per node. This is an area of active research and the Handbook of 
Constraint Programming [H] provides more details on the many techniques 
that can be used to solve constraint problems. 

Constraint problems are often highly symmetric. Symmetries may be in- 
herent in the problem or be created in the process of representing the problem 
as a CSP. A symmetry can be as simple as being able to swap the assign- 
ments of two variables in every solution or involve complex permutations of 
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the assignments. In general, it is desirable to rule out the symmetries during 
search. This often leads to a massive reduction in the search space while the 
solutions that have been ruled out can be recovered after the problem has 
been solved at a low computational cost. 

The process of removing symmetries is referred to as symmetry breaking. 
It introduces additional constraints that are redundant with respect to the 
original problem specification, but rule out symmetrical solutions. More 
details on symmetries and symmetry breaking techniques can again be found 
in the Constraint Programming Handbook [13]. 



Table 1: A Semigroup of order 10. 
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i . 2. Semigroups 

We apply our framework to finding the semigroups of order 10. A semi- 
group T = (S, *) consists of a set of elements S and a binary operation 
* : S x S — > S that is associative, satisfying (x * y) * z = x * (y * z) for 
each x,y,z e S. Table [T] is an illustrative example of such an object. Given 
a permutation 7r of the elements of {0, . . . ,9}, a semigroup isomorphic to 
T is obtained by permuting the rows, the columns, and finally the values 
according to n. An anti-isomorphism is the transpose of an isomorphism. 



5 



The problem addressed in this paper is finding all ways of filling in a blank 
table such that multiplication is associative up to symmetric equivalence, i.e. 
up to isomorphism or anti-isomorphism. For orders less than 10, this problem 
can be solved by a combination of enumeration formulae and computation on 
a single processor. Table [2]- with entries taken from sequence A001423 of the 
On-Line Encyclopaedia of Integer Sequences - demonstrates the combinatoric 
growth in the number of solutions with increasing order, and motivates the 
use of multiple compute nodes to explore the solution space. The table for 
the semigroups of order n has n 2 cells, and each of these can take any one 

2 

of n values. Hence the search space for order n is n n . For the problem 
under consideration, n = 10, the size of the search space is 10 100 . To put this 
number into context, it is currently estimated that there are approximately 
10 80 atoms in the universe. The search space for our problem is so vast that 
we cannot possibly hope to solve it by brute force search. 

Table 2: Number of semigroups of order n, considered to be equivalent when they are 
isomorphic or anti-isomorphic 
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1,160 
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15,973 
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836,021 
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1,843,120,128 


9 


52,989,400,714,478 



Recent advances in the theory of finite semigroups have led to an enumer- 
ative formula [15] that gives the number of 'almost all' semigroups of given 
order. Despite this, 256,587,290,511,904 non-equivalent solutions had to be 
found using the framework described in this paper. 

The constraint model of semigroups of order 10 makes extensive use of 
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the element constraint on natural numbers N, M and P 



N = (M ( 



,...,M n _!)[P] 



which requires that A" is the Pth element of the list (M , . . . , M n _i) in any 
solution. This constraint is implemented in many CSP solvers, including the 
one developed in our group, Minion [16]. 

We let X\ = {T a b | < i,j < 9} be variables representing the entries 
in a 10 x 10 multiplication table T, and X 2 = {A a b )C I < a, b, c < 9} the 
variables representing each of the products of three elements. Our basic CSP 
contains the variables X = X\ U X2, each with domain D — {0, . . . , 9}. For 
each triple (a, b, c) of values from D, we post the pair of constraints 



which enforce associativity. We rule out search for semigroups given by a 
formula by posting constraints that require at least one assignment of all the 
variables in X 2 to be non-zero. A full description of the CSP model and its 
reduction into case-splits is given in [17] . 

Finding all solutions of this CSP solves our problem apart from ruling out 
symmetric equivalents. Our symmetry group is the set of permutations of 
{0, . . . , 9} combined with possible transpositions of the tables. If g = (n, (p) 
is such a symmetry and T is a multiplication table, then T 9 is the table 
obtained by first permuting the rows and columns of T according to tt, and 
either transposing the table or doing nothing, depending on 0. 

We ensure that only canonical solutions are returned by identifying the 
symmetry group using the GAP computational algebra package [18], then 
posting "lex-leader" symmetry-breaking constraints before search. This is a 
well-known technique for dealing with symmetries in CSPs [T9l I2U] . made 
harder to implement in our case because our symmetries involve both vari- 
ables and values and made harder to deploy because we need to post up to 
2 x 10! = 7, 257, 600 symmetry-breaking constraints. 

2. Related work 

The parallelisation of depth-first search has been the subject of much 
research in the past. The first papers on the subject study the distribution 
over various specific hardware architectures and investigate how to achieve 
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good load balancing [2U 122] • Distributed solving of constraint problems 
specifically was first explored only a few years later [23J. 

Backtracking search in a distributed setting has also been investigated 
by several authors j2H I2S]. A special variant for distributed scenarios, asyn- 
chronous backtracking, was proposed in [2E]. Yokoo et al. formalise the dis- 
tributed constraint satisfaction problem and present algorithms for solving 
it [27]. 

Schulte presents the architecture of a system that uses networked com- 
puters [2H] . The focus of his approach is to provide a high-level and reusable 
design for parallel search and achieve a good speedup compared to sequen- 
tial solving rather than good resource utilisation. More recent papers have 
explored how to transparently parallelise search without having to modify 
existing code [29] . 

Most of the existing work is concerned with the problem of effectively 
distributing the workload such that every compute node is kept busy. The 
most prevalent technique used to achieve this is work stealing. The compute 
nodes communicate with each other and nodes which are idle request a part 
of the work that a busy node is doing. Blumofe and Leiserson propose and 
discuss a work stealing scheduler for multithreaded computations in [30J. 
Rolf and Kuchcinski investigate different algorithms for load balancing and 
work stealing in the specific context of distributed constraint solving |31j . 

Several frameworks for distributed constraint solving have been proposed 
and implemented, e.g. FRODO [32], DisChoco [33J and Disolver |34j . All 
of these approaches have in common that the systems to solve constraint 
problems are modified or augmented to support distribution of parts of the 
problem across and communication between multiple compute nodes. The 
constraint model of the problem remains unchanged however; no special con- 
structs have to be used to take advantage of distributed solving. All par- 
allelisation is handled in the respective solver. This does not preclude the 
use of an entirely different model of the problem to be solved for the dis- 
tributed case in order to improve efficiency, but in general these solvers are 
able to solve the same model both with a single executor and distributed 
across several executors. 

The decomposition of constraint problems into subproblems which can be 
solved independently has been proposed in [35], albeit in a different context. 
In this work, we explore the use of this technique for parallelisation. A similar 
approach was taken in |31J, but requires parallelisation support in the solver. 
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3. Distributing CSPs 



Our approach to parallelising the solving of constraint problems has been 
previously described in This paper updates the description and, cru- 

cially, reports results from an application of the framework. 

Constraint problems are typically solved by searching through the possi- 
ble assignments of values to variables. After each such assignment, inference 
can rule out possible future assignments based on past assignments and the 
constraints. This process builds a search tree that explores the space of 
possible (partial) solutions to the constraint problem. 

There are two different ways to build up these search trees - n-way branch- 
ing and 2-way branching. This refers to the number of new branches which 
are explored after each node. In n-way branching, all the n possible assign- 
ments to the next variable are branched on. In 2-way branching, there are 
two branches. The left branch is of the form x = y where x is a variable and 
y is a value from its domain. The right branch is of the form x ^ y. 

The more commonly used way is 2-way branching, implemented for exam- 
ple in the Minion constraint solver [16J, available at http : //minion, sf .net. 
However, regardless of the way the branching is done, exploring the branches 
can be done concurrently. No information between the branches needs to be 
exchanged in order to find a solution to the problem. 

We exploit this fact by, given the model of a constraint problem, gener- 
ating new models which partition the remaining search space. These models 
can then be solved independently. We furthermore represent the state of the 
search by adding additional constraints such that the splitting of the model 
can occur at any point during search. The new models can be resumed, tak- 
ing advantage of both the splitting of the search space and the search already 
performed. 

3.1. Model splitting 

Our new approach to the distributed solving of constraint problems re- 
quires the constraint solver to modify the constraint model but does not 
require explicit parallelisation support in the solver. 

To split the remaining search space of a constraint problem, we signal 
the solver to stop. Now we partition the domain for the variable currently 
under consideration into n pieces of roughly equal size. Then we create n 
new models and to each in turn add constraints ruling out the other n — 1 
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partitions of that domain. Each one of these models restricts the possible 
assignments to the current variable to one nth of its domain. 

As an example, consider the case n = 2. The variable under consideration 
is x and its domain is {1,2,3,4}. We generate 2 new models. One of them 
has the constraint x < 2 added and the other one x > 3. Thus, solving the 
first model will try the values 1 and 2 for x, whereas the second model will 
try 3 and 4. 

The main problem when splitting constraint problems into parts that can 
be solved in parallel is that the size of the remaining search space for each 
of the splits is impossible to predict reliably. This directly affects the effec- 
tiveness of the splitting however - if the search space is distributed unevenly, 
some of the workers will be idle while the others do most of the work. 

Our approach allows to repeatedly split the search space after search 
has started. We use the procedure described above several times, each time 
adding more constraints to the model. In addition, we add restart nogoods, 
that is, additional constraints that tell the solver how much of the search 
space has been explored. Constraints added in a previous iteration are not 
affected by constraints added later - regardless of how often we split, no parts 
of the search space will be "lost" , potentially missing solutions. Similarly, no 
part of the search space will be visited repeatedly. 

Assume for example that we are doing 2-way branching, the variable 
currently under consideration is again x with domain {1,2,3,4} and the 
branches that we have taken to get to the point where we are are x ^ 1 and 
x 7^ 2. The generated new models will all have the constraints a; / 1 and 
x 7^ 2 to get to the point in the search tree where we split the problem. Then 
we add constraints to partition the search space based on the remaining 
values in the domain of x similar to the previous example. The splitting 
process and subsequent parallel search is illustrated in Figure [T} 

Using this technique, we can create new chunks of work whenever a worker 
becomes idle by simply asking one of the busy workers to split the search 
space. The search is then resumed from where it was stopped and the re- 
maining search space is explored in parallel by the two workers. Note that 
there is a runtime overhead involved with stopping and resuming search be- 
cause the constraints which enable resumption must be taken into account 
and the solver needs to explore a small number of search nodes to get to the 
point where it was stopped before. There is also a memory overhead because 
the additional constraints need to be stored. 

We have implemented this approach in a development version of Minion, 
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split search 




Figure 1: Illustration of how search proceeds with splitting. The dashed and dotted 
line shows the search up to the solid black node where the models are split. The nodes 
the two parallel searches explored subsequently are shown with dashed and dotted lines, 
respectively. 



which we are planning to release to the public. Experiments show that 
the overhead of stopping, splitting and resuming is not significant for large 
problems. 

In practice, we run Minion for a specified amount of time, then stop, split 
and resume instead of splitting at the beginning and when workers become 
idle. This approach is much simpler and works well for large problems. The 



algorithm is detailed in Procedure dist Solve It creates an n-ary split tree of 



models for n new models generated at each split. The procedure for finding 
all solutions is similar. Initially, the potential for distribution is small but 
grows exponentially as more and more search is performed. We have found 
that n = 2 works well in practice because it is the easiest to implement and 
minimises the number of models created. 

Minion models are stored in ordinary files. Each time the search space is 
split, two new input files are written. We modified the output produced by 
Minion to include the names of the files it produced and included the name 
of the file that was run when the search space was split in the new model 
files. This way, we can easily trace the splitting of the search space across 
the different files. 

3.2. Comparison to existing approaches 

The main advantages of our approach are as follows. 



11 



Input : constraint problem X, allotted time T max and splitting factor 
n > 2 

Output: a solution to X or "no solution" if no solution has been found 
run the constraint solver with input X until termination or T max ; 

if solved? (X) then 

terminate workers; 

return solution; 
else if search space exhausted? then 

j return "no solution"; 
else 

X' <— X with new constraints ruling out search already performed; 
split X' into n parts X[,...,X' n \ 

for i <— 1 to n do in parallel 

j distSolve(X^, T max ,n); 
end 
end 

Procedure distSolve(X,T ma:r ,n): Recursive procedure to find the first solu- 
tion to a constraint problem distributed across several workers. 
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• We require only minimal modifications to existing constraint solvers. In 
particular, we do not require network communication and work stealing 
to be implemented. 

• We do not require communication between workers to achieve good 
utilisation. 

• The creation of separate model files when splitting increases the ro- 
bustness against worker failure and provides accountability for every 
step. 

For the purposes of a framework for solving large Artificial Intelligence 
search problems, the last point is especially crucial. The nature of the ap- 
plications that we have in mind is such that it will be neither easy to verify 
whether a solution is valid nor feasible to repeat the calculations to get a 
confirmation. Furthermore, we have to be able to rely on the capability to 
recover from failures without having to repeat all the work. 

By creating regular "snapshots" of the search done, the resilience against 
failure increases. This is in contrast to most other approaches, where the 
reliability of the system is decreased by using techniques that distribute work 
and rely on several machines instead of just a single one. Such systems have 
then to take additional measures to mitigate the problems caused by failures 
of machines or communication links. Every time we split the search space, 
the modified models are saved. As they contain constraints that rule out the 
search already done, we only lose the work done after that point if a worker 
fails. This means that the maximum amount of work we lose in case of a total 
failure of all workers is the allotted time T max times the number of workers 
\w\. 

We note that our approach provides many of the advantages of efforts 
dedicated to improving the robustness and accountability of computations, 
e.g. [37J, but is much easier to implement and only requires a minimal amount 
of supporting infrastructure. 

Another consequence of our approach is that the solving process can be 
moved to a different set of workers after it has been started without losing 
any work. This may become necessary if parts of the problem require much 
more memory to solve than other parts. Instead of provisioning workers 
with a large number of resources for the entire duration of the computation, 
it becomes feasible to do this on-demand. This allows for excellent and easy 
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integration with existing services that offer on-demand computing, such as a 
cloud. 

3.3. Large-scale distribution 

In the previous sections, we have described the techniques that enable the 
distribution of the solving of a constraint problem across a set of workers, but 
not the system to take care of the actual distribution. The implementation 
of such a system is notoriously difficult, hence we decided to leverage a tried- 
and-tested existing system. 

For the purposes of a framework that allows to distribute problems across 
a large number of heterogeneous workers, the Condor HPC system is 
particularly suitable. It runs in many different operating and network envi- 
ronments and provides most of the functionality we require out of the box. 
In particular, it allows for the transfer of files that are created on the worker 
back to the master - the constraint models that split the search space. 

Condor allows work units to be submitted to a central node which puts 
them in a queue to be executed when a worker becomes available. In our 
case, a constraint model is a unit of work and splitting the search space on 
one of the workers creates two new units of work that are transferred back 
to the master and queued for execution. The condor job submission system 
makes sure that a job is executed to completion, i.e. if a worker node fails 
while it is processing a work unit, Condor requeues the work unit and sends 
it to a different worker. 

Each Condor work unit needs to be created separately. In order to submit 
models that split the search space and are created during search, we have 
implemented a custom control system that monitors Condor and takes the 
appropriate action when split models are returned. The control system is an 
almost trivial piece of software that was very easy to implement - all of the 
heavy lifting is done by Condor. 

While Condor is a very adequate system for our needs, its installation is 
not always straightforward. Ultimately, the scale of problems we are aim- 
ing for might require not thousands of machines but tens of thousands. No 
institution or even set of institutions has sufficient resources to make this 
available for a single project. Fortunately, the rise of the internet has fa- 
cilitated so-called volunteer computing, where interested users can "donate" 
compute time to a project of their choice. 
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The best-known framework for such projects is BOINC, the Berkeley 
Open Infrastructure for Network Computing [39J . It has been used for many 
applications, including astrophysics, biology and mathematics. We have in- 
tegrated the Minion constraint solver with the BOINC framework in a way 
that allows for splitting the search space. This system provides many of 
the benefits of Condor but makes it much easier for non-technical users to 
contribute. 

Submit machine 



Condor master 




Private cloud 



Amazon cloud 



Figure 2: Overview of resources used for the enumeration of semigroups. We used two 
research clusters: a private cloud hosted in St Andrews and the Amazon cloud. One of 
the research clusters was behind a NAT switch such that no machines on the outside could 
reach it directly and all connections had to be initiated from within it. 



4. Application and discussion 

We first validated our framework empirically by using it to compute the 
number of semigroups of order 9, a problem that had previously been solved 
using non-distributed search. We were able to confirm the known result on 
a number of different hardware configurations and splitting parameters, i.e. 
the time search is run before splitting the model. 
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Encouraged by the results of these experiments, we started the calcula- 
tion of the number of semigroups of order 10. The hardware configuration 
throughout the computations varied, but the principal resources we used are 
shown in Figure [2] Here, one of the main advantages of our framework be- 
came apparent. The different resources we used were located in different 
networks that did not always have unrestricted connectivity to the other 
nodes. One of the research group clusters for example was behind a NAT in 
its own private network and unable to receive connections from outside this 
network. We were still able to utilise the resources to their full extent. 

The submit machine and the Condor master shown in Figure [2] were 
not used for any of the computations, but only for the management of the 
calculations. It should be noted that there is no reason to have dedicated 
machines for those purposes as the resource requirements for the tasks they 
performed were very low. In principle, a machine used for management of 
the computations could also be used to perform computations itself. 

The maximum number of processors that we used in parallel at any one 
time was about 150. One of the reasons for using the Amazon cloud was 
that it turned out that the machines we had available locally did not have 
enough memory to explore some parts of the search space efficiently. We 
were able to move those calculations to virtual machines in the Amazon 
cloud with suitable specifications and seamlessly integrate the results of those 
computations with the rest. 

The total CPU time we expended to solve the problem (i.e. find exactly 
256,587,290,511,904 semigroups from 10 100 potential tables) was approxi- 
mately 133 years. This effort was achieved in approximately 18 months; 
full details of the mathematics and the case-splits used are described in [17J. 
The limiting factor were the resources that were available to us. Even though 
we did not start with a short search time before splitting, enough split mod- 
els to utilise all our resources were available after a few hours. For shorter 
computations, it might be desirable to facilitate faster splitting at the begin- 
ning to achieve good utilisation earlier, but for our purposes the framework 
as described previously was sufficient. The number of split models produced 
suggested that we could have utilised up to several thousand processors to a 
very high degree. 

The robustness of our framework proved useful several times during the 
computations. Events that we successfully coped with included power and 
network outages, air-conditioning failures, physical machines being switched 
off and virtual machines disappearing. The damage in terms of computa- 
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tional effort lost was very limited in all cases. Condor was able to recover 
from most of these failures without any manual intervention by simply re- 
queueing the failed jobs. The verification of the distribution process revealed 
that because of the re-queueing a small part of the search space had been 
explored several times, but we were able to isolate and discard the duplicate 
model and output files. 

After the computations finished, we were able to verify each step of the 
distribution and solving process. Therefore, we are confident that the result 
we obtained is correct. Ultimately, certainty of the correctness can only be 
established by either a new mathematical model that allows to calculate the 
computed number directly, or by independent verification through a second 
computation. 

5. Conclusions and future work 

We have presented a framework for the large-scale distribution of AI 
search in constraint programming across resources with minimal network 
connectivity requirements. We have implemented this framework and applied 
the implementation to solving a hitherto open problem in computational 
mathematics. Throughout this application, the framework has proved to 
fulfill all our requirements. It is capable of scaling almost seamlessly to a 
large number of distributed and heterogeneous resources while minimising 
losses due to hardware failures. It furthermore provides the functionality to 
verify each step of the distribution process, creating confidence in the results. 

The type of our application is relatively rare in eScience. Instead of large 
amounts of data to process, we have a very concise problem specification that 
takes vast computational resources to solve. We believe that the nature of 
such problems presents unique challenges to eScience that have rarely been 
considered so far. 

There is no indication that the positive experiences we have had with the 
specific application described here is limited to that particular problem. We 
have, neither in the design of the framework nor its application, made any 
assumptions to that effect. We are currently evaluating the application of the 
framework to other problems that can be expressed as constraint problems 
and require large computational efforts. 

An obvious avenue for future work apart from the application to new 
problems that we would like to explore is the evaluation of the implementa- 
tion of the framework that uses BOINC instead of Condor. An application to 
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the same problem would allow us to not only judge the differences in terms 
of distribution effectivity and utilisation, but also to independently verify the 
results that we have obtained. While we are confident that we would indeed 
obtain the same result, an empirical verification would eliminate any doubts 
about this aspect of the framework. 

We are planning to release as open source the modifications we have made 
to the Minion constraint solver in order to support splitting searches. Fur- 
thermore, we are intending to release all other components of the framework 
that are not already available to the public, thus enabling other researchers 
to tackle similarly large problems and providing a framework that we hope 
will prove useful to the research community. 
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